Synchronization for document narration

ABSTRACT

Disclosed are techniques and systems for synchronizing an audio file with a sequence of words displayed on a user interface.

This application claims priority from and incorporates herein U.S.Provisional Application No. 61/144,947, filed Jan. 15, 2009, and titled“SYSTEMS AND METHODS FOR SELECTION OF MULTIPLE VOICES FOR DOCUMENTNARRATION” and U.S. Provisional Application No. 61/165,963, filed Apr.2, 2009, and titled “SYSTEMS AND METHODS FOR SELECTION OF MULTIPLEVOICES FOR DOCUMENT NARRATION.”

BACKGROUND

This invention relates generally to educational and entertainment toolsand more particularly to techniques and systems which are used toprovide a narration of a text.

Recent advances in computer technology and computer based speechsynthesis have opened various possibilities for the artificialproduction of human speech. A computer system used for artificialproduction of human speech can be called a speech synthesizer. One typeof speech synthesizer is text-to-speech (TTS) system which convertsnormal language text into speech.

SUMMARY

Educational and entertainment tools and more particularly techniques andsystems which are used to provide a narration of a text are describedherein.

Systems, software and methods enabling a user to select different voicemodels to apply to different portions of text such that when the systemreads the text the different portions are read using the different voicemodels are described herein.

In some aspects, a computer implemented method includes applying speechrecognition by one or more computer systems to an audio recording togenerate a text version of recognized words in the audio recording. Themethod also includes determining by the one or more computer systems anelapsed time period from the start of the audio recording to each wordin the sequence of words in the audio recording. The method alsoincludes comparing by the one or more computer systems the words in thetext version of the recognized words in the audio recording to the wordsin a sequence of expected words. The method also includes generating bythe one or more computer systems a word timing file comprising theelapsed time information for each word in the sequence of expected wordsby outputting the elapsed time information for a particular word intothe word timing file if the recognized word in the text version of therecognized words matches the expected word and correcting a particularword and associating the elapsed time information with the particularword if the particular word in the text version of the recognized wordsdoes not match the expected word. Embodiments may also include devices,software, components, and/or systems to perform any features describedherein.

The details of one or more embodiments of the invention are set forth inthe accompanying drawings and the description below. Other features,objects, and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system for producing speech-based outputfrom text.

FIG. 2 is a screenshot depicting text.

FIG. 3 is a screenshot of text that includes highlighting of portions ofthe text based on a narration voice.

FIG. 4 is a flow chart of a voice painting process.

FIG. 5 is a screenshot of a character addition process.

FIG. 6 is a flow chart of a character addition process.

FIG. 7 is a diagram of text with tagged narration data.

FIG. 8 is a screenshot of text with tagged narration information.

FIG. 9 is a diagram of text with highlighting.

FIG. 10 is a flow chart of a synchronization process.

FIG. 11 is a screenshot of a book view of text.

FIG. 12 is a screenshot of text.

FIG. 13 is a screenshot of text.

DETAILED DESCRIPTION

Referring now to FIG. 1, a system 10 for producing speech-based outputfrom text is shown to include a computer 12. The computer 12 isgenerally a personal computer or can alternatively be another type ofdevice, e.g., a cellular phone that includes a processor (e.g., CPU).Examples of such cell-phones include an iPhone® (Apple, Inc.). Otherdevices include an iPod® (Apple, Inc.), a handheld personal digitalassistant, a tablet computer, a digital camera, an electronic bookreader, etc. In addition to a processor, the device includes a mainmemory and a cache memory and interface circuits, e.g., bus and I/Ointerfaces (not shown). The computer system 12 includes a mass storageelement 16, here typically the hard drive associated with personalcomputer systems or other types of mass storage, Flash memory, ROM,PROM, etc.

The system 10 further includes a standard PC type keyboard 18, astandard monitor 20 as well as speakers 22, a pointing device such as amouse and optionally a scanner 24 all coupled to various ports of thecomputer system 12 via appropriate interfaces and software drivers (notshown). The computer system 12 can operate under a Microsoft Windowsoperating system although other systems could alternatively be used.

Resident on the mass storage element 16 is narration software 30 thatcontrols the narration of an electronic document stored on the computer12 (e.g., controls generation of speech and/or audio that is associatedwith (e.g., narrates) text in a document). Narration software 30includes an edit software 30 a that allows a user to edit a document andassign one or more voices or audio recordings to text (e.g., sequencesof words) in the document and can include playback software 30 b thatreads aloud the text from the document, as the text is displayed on thecomputer's monitor 20 during a playback mode.

Text is narrated by the narration software 30 using several possibletechnologies: text-to-speech (TTS); audio recording of speech; andpossibly in combination with speech, audio recordings of music (e.g.,background music) and sound effects (e.g., brief sounds such asgunshots, door slamming, tea kettle boiling, etc.). The narrationsoftware 30 controls generation of speech, by controlling a particularcomputer voice (or audio recording) stored on the computer 12, causingthat voice to be rendered through the computer's speakers 22. Narrationsoftware often uses a text-to-speech (TTS) voice which artificiallysynthesizes a voice by converting normal language text into speech. TTSvoices vary in quality and naturalness. Some TTS voices are produced bysynthesizing the sounds for speech using rules in a way which results ina voice that sounds artificial, and which some would describe asrobotic. Another way to produce TTS voices concatenates small parts ofspeech which were recorded from an actual person. This concatenated TTSsounds more natural. Another way to narrate, other than TTS, is play anaudio recording of a person reading the text, such as, for example, abook on tape recording. The audio recording may include more than oneactor speaking, and may include other sounds besides speech, such assound effects or background music. Additionally, the computer voices canbe associated with different languages (e.g., English, French, Spanish,Cantonese, Japanese, etc).

In addition, the narration software 30 permits the user to select andoptionally modify a particular voice model which defines and controlsaspects of the computer voice, including for example, the speaking speedand volume. The voice model includes the language of the computer voice.The voice model may be selected from a database that includes multiplevoice models to apply to selected portions of the document. A voicemodel can have other parameters associated with it besides the voiceitself and the language, speed and volume, including, for example,gender (male or female), age (e.g. child or adult), voice pitch, visualindication (such as a particular color of highlighting) of document textthat is associated with this voice model, emotion (e.g. angry, sad,etc.), intensity (e.g. mumble, whisper, conversational, projecting voiceas at a party, yell, shout). The user can select different voice modelsto apply to different portions of text such that when the system 10reads the text the different portions are read using the different voicemodels. The system can also provide a visual indication, such ashighlighting, of which portions are associated with which voice modelsin the electronic document.

Referring to FIG. 2, text 50 is rendered on a user display 51. As shown,the text 50 includes only words and does not include images. However, insome examples, the text could include portions that are composed ofimages and portions that are composed of words. The text 50 is atechnical paper, namely, “The Nature and Origin of InstructionalObjects.” Exemplary texts include but not limited to electronic versionsof books, word processor documents, PDF files, electronic versions ofnewspapers, magazines, fliers, pamphlets, menus, scripts, plays, and thelike. The system 10 can read the text using one or more stored voicemodels. In some examples, the system 10 reads different portions of thetext 50 using different voice models. For example, if the text includesmultiple characters, a listener may find listening to the text moreengaging if different voices are used for each of the characters in thetext rather than using a single voice for the entire narration of thetext. In another example, extremely important or key points could beemphasized by using a different voice model to recite those portions ofthe text.

As used herein a “character” refers to an entity and is typically storedas a data structure or file, etc. on computer storage media and includesa graphical representation, e.g., picture, animation, or anothergraphical representation of the entity and which may in some embodimentsbe associated with a voice model. A “mood” refers to an instantiation ofa voice model according to a particular “mood attribute” that is desiredfor the character. A character can have multiple associated moods. “Moodattributes” can be various attributes of a character. For instance, oneattribute can be “normal,” other attributes include “happy,” “sad,”“tired,” “energetic,” “fast talking,” “slow talking,” “native language,”“foreign language,” “hushed voice “loud voice,” etc. Mood attributes caninclude varying features such as speed of playback, volumes, pitch, etc.or can be the result of recording different voices corresponding to thedifferent moods.

For example, for a character, “Homer Simpson” the character includes agraphical depiction of Homer Simpson and a voice model that replicates avoice associated with Homer Simpson. Homer Simpson can have variousmoods, (flavors or instantiations of voice models of Homer Simpson) thatemphasize one or more attributes of the voice for the different moods.For example, one passage of text can be associated with a “sad” HomerSimpson voice model, whereas another a “happy” Homer Simpson voice modeland a third with a “normal” Homer Simpson voice model.

Referring to FIG. 3, the text 50 is rendered on a user display 51 withthe addition of a visual indicium (e.g., highlighting) on differentportions of the text (e.g., portions 52, 53, and 54). The visualindicium (or lack of a indicium) indicates portions of the text thathave been associated with a particular character or voice model. Thevisual indicium is in the form of, for example, a semi-transparent blockof color over portions of the text, a highlighting, a different color ofthe text, a different font for the text, underlining, italicizing, orother visual indications (indicia) to emphasize different portions ofthe text. For example, in text 50 portions 52 and 54 are highlighted ina first color while another portion 53 is not highlighted. When thesystem 10 generates the narration of the text 50, different voice modelsare applied to the different portions associated with differentcharacters or voice models that are represented visually by the texthaving a particular visual indicia. For example, a first voice modelwill be used to read the first portions 52 and 54 while a second voicemodel (a different voice model) will be used to read the portion 53 ofthe text.

In some examples, text has some portions that have been associated witha particular character or voice model and others that have not. This isrepresented visually on the user interface as some portions exhibiting avisual indicium and others not exhibiting a visual indicium (e.g., thetext includes some highlighted portions and some non-highlightedportions). A default voice model can be used to provide the narrationfor the portions that have not been associated with a particularcharacter or voice model (e.g., all non-highlighted portions). Forexample, in a typical story much of the text relates to describing thescene and not to actual words spoken by characters in the story. Suchnon-dialog portions of the text may remain non-highlighted and notassociated with a particular character or voice model. These portionscan be read using the default voice (e.g., a narrator's voice) while thedialog portions may be associated with a particular character or voicemodel (and indicated by the highlighting) such that a different, uniquevoice is used for dialog spoken by each character in the story.

FIG. 3 also shows a menu 55 used for selection of portions of a text tobe read using different voice models. A user selects a portion of thetext by using an input device such as a keyboard or mouse to select aportion of the text, or, on devices with a touchscreen, a finger orstylus pointing device may be used to select text. Once the user hasselected a portion of the text, a drop down menu 55 is generated thatprovides a list of the different available characters (e.g., characters56, 58, and 60) that can be used for the narration. A character need notbe related directly to a particular character in a book or text, butrather provides a specification of the characteristics of a particularvoice model that is associated with the character. For example,different characters may have male versus female voices, may speak indifferent languages or with different accents, may read more quickly orslowly, etc. The same character can be associated with multipledifferent texts and can be used to read portions of the different texts.

Each character 56, 58, and 60 is associated with a particular voicemodel and with additional characteristics of the reading style of thecharacter such as language, volume, speed of narration. By selecting(e.g., using a mouse or other input device to click on) a particularcharacter 56, 58, or 60, the selected portion of the text is associatedwith the voice model for the character and will be read using the voicemodel associated with the character.

Additionally, the drop down menu includes a “clear annotation” button 62that clears previously applied highlighting and returns the portion oftext to non-highlighted such that it will be read by the Narrator ratherthan one of the characters. The Narrator is a character whose initialvoice is the computer's default voice, though this voice can beoverridden by the user. All of the words in the document or text caninitially all be associated with the Narrator. If a user selects textthat is associated with the Narrator, the user can then perform anaction (e.g. select from a menu) to apply another one of the charactersfor the selected portion of text. To return a previously highlightedportion to being read by the Narrator, the user can select the “clearannotation” button 62.

In order to make selection of the character more user friendly, the dropdown menu 55 can include an image (e.g., images 57, 59, and 61) of thecharacter. For example, one of the character voices can be similar tothe voice of the Fox television cartoon character Homer Simpson (e.g.,character 58), an image of Homer Simpson (e.g., image 59) could beincluded in the drop down menu 55. Inclusion of the images is believedto make selection of the desired voice model to apply to differentportions of the text more user friendly.

Referring to FIG. 4 a process 100 for selecting different characters orvoice models to be used when the system 10 reads a text is shown. Thesystem 10 displays 102 the text on a user interface. In response to auser selection, the system 10 receives 104 a selection of a portion ofthe text and displays 106 a menu of available characters each associatedwith a particular voice model. In response to a user selecting aparticular character (e.g., by clicking on the character from the menu),the system receives 108 the user selected character and associates theselected portion of the text with the voice model for the character. Thesystem 10 also generates a highlight 110 or generates some other type ofvisual indication to apply to that the portion of the text and indicatethat that portion of text is associated with a particular voice modeland will be read using the particular voice model when the user selectsto hear a narration of the text. The system 10 determines 112 if theuser is making additional selections of portions of the text toassociate with particular characters. If the user is making additionalselections of portions of the text, the system returns to receiving 104the user's selection of portions of the text, displays 106 the menu ofavailable characters, receives a user selection and generates a visualindication to apply to a subsequent portion of text.

As described above, multiple different characters are associated withdifferent voice models and a user associates different portions of thetext with the different characters. In some examples, the characters arepredefined and included in a database of characters having definedcharacteristics. For example, each character may be associated with aparticular voice model that includes parameters such as a relativevolume, and a reading speed. When the system 10 reads text havingdifferent portions associated with different characters, not only canthe voice of the characters differ, but other narration characteristicssuch as the relative volume of the different characters and how quicklythe characters read (e.g., how many words per minute) can also differ.

In some embodiments, a character can be associated with multiple voicemodels. If a character is associated with multiple voice models, thecharacter has multiple moods that can be selected by the user. Each moodhas an associated (single) voice model. When the user selects acharacter the user also selects the mood for the character such that theappropriate voice model is chosen. For example, a character could havemultiple moods in which the character speaks in a different language ineach of the moods. In another example, a character could have multiplemoods based on the type of voice or tone of voice to be used by thecharacter. For example, a character could have a happy mood with anassociated voice model and an angry mood using an angry voice with anassociated angry voice model. In another example, a character could havemultiple moods based on a story line of a text. For example, in thestory of the Big Bad Wolf, the wolf character could have a wolf mood inwhich the wolf speaks in a typical voice for the wolf (using anassociated voice model) and a grandma mood in which the wolf speaks in avoice imitating the grandmother (using an associated voice model).

FIG. 5 shows a screenshot of a user interface 120 on a user display 121for enabling a user to view the existing characters and modify, delete,and/or generate a character. With the interface, a user generates a castof characters for the text. Once a character has been generated, thecharacter will be available for associating with portions of the text(e.g., as discussed above). A set of all available characters isdisplayed in a cast members window 122. In the example shown in FIG. 5,the cast members window 122 includes three characters, a narrator 124,Charlie Brown 126, and Homer Simpson 128. From the cast members window122 the user can add a new character by selecting button 130, modify anexisting character by selecting button 132, and/or delete a character byselecting button 134.

The user interface for generating or modifying a voice model ispresented as an edit cast member window 136. In this example, thecharacter Charlie Brown has only one associated voice model to definethe character's voice, volume and other parameters, but as previouslydiscussed, a character could be associated with multiple voice models(not shown in FIG. 5). The edit cast member window 136 includes an inputportion 144 for receiving a user selection of a mood or character name.In this example, the mood of Charlie Brown has been input into inputportion 144. The character name can be associated with the story and/orassociated with the voice model. For example, if the voice modelemulates the voice of an elderly lady, the character could be named“grandma.”

In another example, if the text which the user is working on is Romeoand Juliet, the user could name one of the characters Romeo and anotherJuliet and use those characters to narrate the dialog spoken by each ofthe characters in the play. The edit cast member window 136 alsoincludes a portion 147 for selecting a voice to be associated with thecharacter. For example, the system can include a drop down menu ofavailable voices and the user can select a voice from the drop down menuof voices. In another example, the portion 147 for selecting the voicecan include an input block where the user can select and upload a filethat includes the voice. The edit cast member window 136 also includes aportion 145 for selecting the color or type of visual indicia to beapplied to the text selected by a user to be read using the particularcharacter. The edit cast member window 136 also includes a portion 149for selecting a volume for the narration by the character.

As shown in FIG. 5, a sliding scale is presented and a user moves aslider on the sliding scale to indicate a relative increase or decreasein the volume of the narration by the corresponding character. In someadditional examples, a drop down menu can include various volume optionssuch as very soft, soft, normal, loud, very loud. The edit cast memberwindow 136 also includes a portion 146 for selecting a reading speed forthe character. The reading speed provides an average number of words perminute that the computer system will read at when the text is associatedwith the character. As such, the portion for selecting the reading speedmodifies the speed at which the character reads. The edit cast memberwindow 136 also includes a portion 138 for associating an image with thecharacter. This image can be presented to the user when the user selectsa portion of the text to associate with a character (e.g., as shown inFIG. 3). The edit cast member window 136 can also include an input forselecting the gender of the character (e.g., as shown in block 140) andan input for selecting the age of the character (e.g., as shown in block142). Other attributes of the voice model can be modified in a similarmanner.

Referring to FIG. 6, a process 150 for generating elements of acharacter and its associated voice model are shown. The system displays152 a user interface for adding a character. The user inputs informationto define the character and its associated voice model. While thisinformation is shown as being received in a particular order in the flowchart, other orders can be used. Additionally, the user may not provideeach piece of information and the associated steps may be omitted fromthe process 150.

After displaying the user interface for adding a character, the systemreceives 154 a user selection of a character name. For example, the usercan type the character name into a text box on the user interface. Thesystem also receives 156 a user selection of a computer voice toassociate with the character. The voice can be an existing voiceselected from a menu of available voices or can be a voice stored on thecomputer and uploaded at the time the character is generated. The systemalso receives 158 a user selection of a type of visual indicia or colorfor highlighting the text in the document when the text is associatedwith the character. For example, the visual indicium or color can beselected from a list of available colors which have not been previouslyassociated with another character. The system also receives 160 a userselection of a volume for the character. The volume will provide therelative volume of the character in comparison to a baseline volume. Thesystem also receives 162 a user selection of a speed for the character'sreading. The speed will determine the average number of words per minutethat the character will read when narrating a text. The system stores164 each of the inputs received from the user in a memory for later use.If the user does not provide one or more of the inputs, the system usesa default value for the input. For example, if the user does not providea volume input, the system defaults to an average volume.

Different characters can be associated with voice models for differentlanguages. For example, if a text included portions in two differentlanguages, it can be beneficial to select portions of the text and havethe system read the text in the first language using a first characterwith a voice model in the first language and read the portion in thesecond language using a second character with a voice model in thesecond language. In applications in which the system uses atext-to-speech application in combination with a stored voice model toproduce computer generated speech, it can be beneficial for the voicemodels to be language specific in order for the computer to correctlypronounce and read the words in the text.

For example, text can include a dialog between two different charactersthat speak in different languages. In this example, the portions of thedialog spoken by a character in a first language (e.g., English) areassociated with a character (and associated voice model) that has avoice model associated with the first language (e.g., a character thatspeaks in English). Additionally, the portions of the dialog a secondlanguage (e.g., Spanish) are associated with a character (and associatedvoice model) speaks in the second language (e.g., Spanish). As such,when the system reads the text, portions in the first language (e.g.,English) are read using the character with an English-speaking voicemodel and portions of the text in the second language (e.g., Spanish)are read using a character with a Spanish-speaking voice model.

For example, different characters with voice models can be used to readan English as a second language (ESL) text in which it can be beneficialto read some of the portions using an English-speaking character andother portions using a foreign language-speaking character. In thisapplication, the portions of the ESL text written in English areassociated with a character (and associated voice model) that is anEnglish-speaking character. Additionally, the portions of the text inthe foreign (non-English) language are associated with a character (andassociated voice model) that is a character speaking the particularforeign language. As such, when the system reads the text, portions inEnglish are read using a character with an English-speaking voice modeland portions of the text in the foreign language are read using acharacter with a voice model associated with the foreign language.

While in the examples described above, a user selected portions of atext in a document to associate the text with a particular charactersuch that the system would use the voice model for the character whenreading that portion of the text, other techniques for associatingportions of text with a particular character can be used. For example,the system could interpret text-based tags in a document as an indicatorto associate a particular voice model with associated portions of text.

Referring to FIG. 7, a portion of an exemplary document rendered on auser display 171 that includes text based tags is shown. Here, theactors names are written inside square braces (using a technique that iscommon in theatrical play scripts). Each line of text has a charactername associated with the text. The character name is set out from thetext of the story or document with a set of brackets or other computerrecognizable indicator such as the pound key, an asterisks, parenthesis,a percent sign, etc. For example, the first line 172 shown in document170 includes the text “[Henry] Hi Sally!” and the second line 174includes the text “[Sally] Hi Henry, how are you?” Henry and Sally areboth characters in the story and character models can be generated toassociate a voice model, volume, reading speed, etc. with the character,for example, using the methods described herein. When the computersystem reads the text of document 170, the computer system recognizesthe text in brackets, e.g., [Henry] and [Sally], as an indicator of thecharacter associated with the following text and will not read the textincluded within the brackets. As such, the system will read the firstline “Hi Sally!” using the voice model associated with Henry and willread the second line “Hi Henry, how are you?” using the voice modelassociated with Sally.

Using the tags to indicate the character to associate with differentportions of the text can be beneficial in some circumstances. Forexample, if a student is given an assignment to write a play for anEnglish class, the student's work may go through multiple revisions withthe teacher before reaching the final product. Rather than requiring thestudent to re-highlight the text each time a word is changed, using thetags allows the student to modify the text without affecting thecharacter and voice model associated with the text. For example, in thetext of FIG. 7, if the last line was modified to read, “. . . Hopefullyyou remembered to wear your gloves” from “. . . Hopefully you rememberedto wear your hat.” Due to the preceding tag of ‘[Sally]’ the modifiedtext would automatically be read using the voice model for Sally withoutrequiring the user to take additional steps to have the word “gloves”read using the voice model for Sally.

Referring to FIG. 8, a screenshot 180 rendered on a user display 181 oftext that includes tagged portions associated with different charactersis shown. As described above, the character associated with a particularportion of the text is indicated in brackets preceding the text (e.g.,as shown in bracketed text 182, 184 and 186). In some situations, astory may include additional portions that are not to be read as part ofthe story. For example, in a play, stage motions or lighting cues may beincluded in the text but should not be spoken when the play is read.Such portions are skipped by the computer system when the computersystem is reading the text. A ‘skip’ indicator indicates portions oftext that should not be read by the computer system. In the exampleshown in FIG. 8, a skip indicator 188 is used to indicate that the text“She leans back in her chair” should not be read.

While in the examples above, the user indicated portions of the text tobe read using different voice models by either selecting the text oradding a tag to the text, in some examples the computer systemautomatically identifies text to be associated with different voicemodels. For example, the computer system can search the text of adocument to identify portions that are likely to be quotes or dialogspoken by characters in the story. By determining text associated withdialog in the story, the computer system eliminates the need for theuser to independently identify those portions.

Referring to FIG. 9, the computer system searches the text of a story200 (in this case the story of the Three Little Pigs) to identify theportions spoken by the narrator (e.g., the non-dialog portions). Thesystem associates all of the non-dialog portions with the voice modelfor the narrator as indicated by the highlighted portions 202, 206, and210. The remaining dialog-based portions 204, 208, and 212 areassociated with different characters and voice models by the user. Bypre-identifying the portions 204, 208, and 212 for which the user shouldselect a character, the computer system reduces the amount of timenecessary to select and associate voice models with different portionsof the story.

In some examples, the computer system can step through each of thenon-highlighted or non-associated portions and ask the user whichcharacter to associate with the quotation. For example, the computersystem could recognize that the first portion 202 of the text shown inFIG. 9 is spoken by the narrator because the portion is not enclosed inquotations. When reaching the first set of quotations including the text“Please man give me that straw to build me a house,” the computer systemcould request an input from the user of which character to associatewith the quotation. Such a process could continue until the entire texthad been associated with different characters.

In some additional examples, the system automatically selects acharacter to associate with each quotation based on the words of thetext using a natural language process. For example, line 212 of thestory shown in FIG. 9 recites “To which the pig answered ‘no, not by thehair of my chinny chin chin.” The computer system recognizes thequotation “no, not by the hair of my chinny chin chin” based on the textbeing enclosed in quotation marks. The system review the text leading upto or following the quotation for an indication of the speaker. In thisexample, the text leading up to the quotation states “To which the piganswered” as such, the system could recognize that the pig is thecharacter speaking this quotation and associate the quotation with thevoice model for the pig. In the event that the computer system selectsthe incorrect character, the user can modify the character selectionusing one or more of techniques described herein.

In some embodiments, the voice models associated with the characters canbe electronic Text-To-Speech (TTS) voice models. TTS voices artificiallyproduce a voice by converting normal text into speech. In some examples,the TTS voice models are customized based on a human voice to emulate aparticular voice. In other examples, the voice models are actual human(as opposed to a computer) voices generated by a human specifically fora document, e.g., high quality audio versions of books and the like. Forexample, the quality of the speech from a human can be better than thequality of a computer generated, artificially produced voice. While thesystem narrates text out loud and highlights each word being spoken,some users may prefer that the voice is recorded human speech, and not acomputer voice.

In order to efficiently record speech associated with a particularcharacter, the user can pre-highlight the text to be read by the personwho is generating the speech and/or use speech recognition software toassociate the words read by a user to the locations of the words in thetext. The computer system read the document pausing and highlighting theportions to be read by the individual. As the individual reads, thesystem records the audio. In another example, a list of all portions tobe read by the individual can be extracted from the document andpresented to the user. The user can then read each of the portions whilethe system records the audio and associates the audio with the correctportion of the text (e.g., by placing markers in an output fileindicating a corresponding location in the audio file). Alternatively,the system can provide a location at which the user should read and thesystem can record the audio and associate the text location with thelocation in the audio (e.g., by placing markers in the audio fileindicating a corresponding location in the document).

In “playback mode”, the system synchronizes the highlighting (or otherindicia) of each word as it is being spoken with an audio recording sothat each word is highlighted or otherwise visually emphasized on a userinterface as it is being spoken, in real time. Referring to FIG. 10 aprocess 230 for synchronizing the highlighting (or other visual indicia)of each word in an audio with a set of expected words so that each wordis visually emphasized on a user interface as it is being spoken isshown. The system processes 232 the audio recording using speechrecognition process executed on a computer. The system, using the speechrecognition process, generates 234 a time mark (e.g., an indication ofan elapsed time period from the start of the audio recording to eachword in the sequence of words) for each word and preferably, eachsyllable, that the speech recognition process recognizes. The system,using the speech recognition process, generates 236 an output file ofeach recognized word or syllable and the time it was recognized,relative to the start time of the recording (e.g., the elapsed time).Other parameters and measurements can be saved to the file. The systemcompares 238 the words in the speech recognition output to the words inthe original text (e.g., a set of expected words). The comparisonprocess compares one word from the original text at a time. Speechrecognition is an imperfect process, so even with a high qualityrecording like an audio book, there may be errors of recognition. Foreach word, based on the comparison of the word in the speech recognitionoutput to the expected word in the original text, the system determineswhether the word in the speech recognition output matches (e.g., is thesame as) the word in the original text. If the word from the originaltext matches the recognized word, the word is output 240 with the timeof recognition to a word timing file. If the words do not match, thesystem applies 242 a correcting process to find (or estimate) a timingfor the original word. The system determines 244 if there are additionalwords in the original text, and if so, returns to determining 238whether the word in the speech recognition output matches (e.g., is thesame as) the word in the original text. If not, the system ends 246 thesynchronization process.

The correcting process can use a number of methods to find the correcttiming from the speech recognition process or to estimate a timing forthe word. For example, the correcting process can iteratively comparethe next words until it finds a match between the original text and therecognized text, which leaves it with a known length of mis-matchedwords. The correcting process can, for example, interpolate the times toget a time that is in-between the first matched word and the lastmatched word in this length of mis-matched words. Alternatively, if thenumber of syllables matches in the length of mis-matched words, thecorrecting process assumes the syllable timings are correct, and setsthe timing of the first mis-matched word according to the number ofsyllables. For example, if the mis-matched word has 3 syllables, thetime of that word can be associated with the time from the 3^(rd)syllable in the recognized text.

Another technique involves using linguistic metrics based onmeasurements of the length of time to speak certain words, syllables,letters and other parts of speech. These metrics can be applied to theoriginal word to provide an estimate for the time needed to speak thatword.

Alternatively, a word timing indicator can be produced by closeintegration with a speech recognizer. Speech recognition is a complexprocess which generates many internal measurements, variables andhypotheses. Using these very detailed speech recognition measurements inconjunction with the original text (the text that is known to bespeaking) could produce highly accurate hypotheses about the timing ofeach word. The techniques described above could be used, but with theadditional information from the speech recognition engine, betterresults could be achieved. The old speech recognition engine would bepart of the new word timing indicator.

Additionally, methods of determining the timings of each word could befacilitated by a software tool that provides a user with a visualdisplay of the recognized words, the timings, the original words andother information, preferably in a timeline display. The user would beable to quickly make an educated guess as to the timings of each wordusing the information on this display. This software tool provides theuser with an interface for the user to indicate which word should beassociated with which timing, and to otherwise manipulate and correctthe word timing file.

Other associations between the location in the audio file and thelocation in the document can be used. For example, such an associationcould be stored in a separate file from both the audio file and thedocument, in the audio file itself, and/or in the document.

In some additional examples, a second type of highlighting, referred toherein as “playback highlighting,” is displayed by the system duringplayback or reading of a text in order to annotate the text and providea reading location for the user. This playback highlighting occurs in aplayback mode of the system and is distinct from the highlighting thatoccurs when a user selects text, or the voice painting highlighting thatoccurs in an editing mode used to highlight sections of the textaccording to an associated voice model. In this playback mode, forexample, as the system reads the text (e.g., using a TTS engine or byplaying stored audio), the system tracks the location in the text of thewords currently being spoken or produced. The system highlights orapplies another visual indicia (e.g., bold font, italics, underlining, amoving ball or other pointer, change in font color) on a user interfaceto allow a user to more easily read along with the system. One exampleof a useful playback highlighting mode is to highlight each word (andonly that word) as it is being spoken by the computer voice. The systemplays back and reads aloud any text in the document, including, forexample, the main story of a book, footnotes, chapter titles and alsouser-generated text notes that the system allows the user to type in.However, as noted herein, some sections or portions of text may beskipped, for example, the character names inside text tags, textindicated by use of the skip indicator, and other types of text asallowed by the system.

In some examples, the text can be rendered as a single document with ascroll bar or page advance button to view portions of the text that donot fit on a current page view, for example, text such as a wordprocessor (e.g., Microsoft Word), document, a PDF document, or otherelectronic document. In some additional examples, the two-dimensionaltext can be used to generate a simulated three-dimensional book view asshown in FIG. 11.

Referring to FIGS. 12 and 13, a text that includes multiple pages can beformatted into the book view shown in FIG. 11 where two pages arearranged side-by-side and the pages are turned to reveal two new pages.Highlighting and association of different characters and voice modelswith different portions of the text can be used with both standard andbook-view texts. In the case of a book-view text, the computer systemincludes page turn indicators which synchronize the turning of the pagein the electronic book with the reading of the text in the electronicbook. In order to generate the book-view from a document such as Word orPDF document, the computer system uses the page break indicators in thetwo-dimensional document to determine the locations of the breaksbetween the pages. Page turn indicators are added to every other page ofthe book view.

A user may desire to share a document with the associated characters andvoice models with another individual. In order to facilitate in suchsharing, the associations of a particular character with portions of adocument and the character models for a particular document are storedwith the document. When another individual opens the document, theassociations between the assigned characters and different portions ofthe text are already included with the document.

Text-To-Speech (TTS) voice models associated with each character can bevery large (e.g., from 15-250 Megabytes) and it may be undesirable tosend the entire voice model with the document, especially if a documentuses multiple voice models. In some embodiments, in order to eliminatethe need to provide the voice model, the voice model is noted in thecharacter definition and the system looks for the same voice model onthe computer of the person receiving the document. If the voice model isavailable on the person's computer, the voice model is used. If thevoice model is not available on the computer, metadata related to theoriginal voice model such as gender, age, ethnicity, and language areused to select a different available voice model that is similar to thepreviously used voice model.

In some additional examples, it can be beneficial to send all neededvoice models with the document itself to reduce the likelihood that therecipient will not have appropriate voice models installed on theirsystem to play the document. However, due to the size of the TTS voicemodels and of human voice-based voice models comprised of storeddigitized audio, it can be prohibitive to send the entire voice model.As such, a subset of words (e.g., a subset of TTS generated words or asubset of the stored digitized audio of the human voice model) can besent with the document where the subset of words includes only the wordsthat are included in the documents. Because the number of unique wordsin a document is typically substantially less than all of the words inthe English language, this can significantly reduce the size of thevoice files sent to the recipient. For example, if a TTS speechgenerator is used, the TTS engine generates audio files (e.g., wavefiles) for words and those audio files are stored with the text so thatit is not necessary to have the TTS engine installed on a machine toread the text. The number of audio files stored with the text can vary,for example, a full dictionary of audio files can be stored. In anotherexample, only the unique audio files associated with words in the textare stored with the text. This allows the amount of memory necessary tostore the audio files to be substantially less than if all words arestored. In other examples, where human voice-based voice modelscomprised of stored digitized audio are used to provide the narration ofa text, either all of the words in the voice model can be stored withthe text or only a subset of the words that appear in the text may bestored. Again, storing only the subset of words included in the textreduces the amount of memory needed to store the files.

In some additional examples, only a subset of the voice models are sentto the recipient. For example, it might be assumed that the recipientwill have at least one acceptable voice model installed on theircomputer. This voice model could be used for the narrator and only thevoice models or the recorded speech for the characters other than thenarrator would need to be sent to the recipient.

In some additional examples, in addition to associating voice models toread various portions of the text, a user can additionally associatesound effects with different portions of the text. For example, a usercan select a particular place within the text at which a sound effectshould occur and/or can select a portion of the text during which aparticular sound effect such as music should be played. For example, ifa script indicates that eerie music plays, a user can select thoseportions of the text and associate a music file (e.g., a wave file) ofeerie music with the text. When the system reads the story, in additionto reading the text using an associated voice model (based on voicemodel highlighting), the system also plays the eerie music (based on thesound effect highlighting).

The systems and methods described herein can be implemented in digitalelectronic circuitry, or in computer hardware, firmware, software,web-enabled applications, or in combinations thereof. Data structuresused to represent information can be stored in memory and in persistentstorage. Apparatus of the invention can be implemented in a computerprogram product tangibly embodied in a machine-readable storage devicefor execution by a programmable processor and method actions can beperformed by a programmable processor executing a program ofinstructions to perform functions of the invention by operating on inputdata and generating output. The invention can be implementedadvantageously in one or more computer programs that are executable on aprogrammable system including at least one programmable processorcoupled to receive data and instructions from, and to transmit data andinstructions to, a data storage system, at least one input device, andat least one output device. Each computer program can be implemented ina high-level procedural or object oriented programming language, or inassembly or machine language if desired, and in any case, the languagecan be a compiled or interpreted language. Suitable processors include,by way of example, both general and special purpose microprocessors.Generally, a processor will receive instructions and data from aread-only memory and/or a random access memory. Generally, a computerwill include one or more mass storage devices for storing data files,such devices include magnetic disks, such as internal hard disks andremovable disks magneto-optical disks and optical disks. Storage devicessuitable for tangibly embodying computer program instructions and datainclude all forms of non-volatile memory, including, by way of example,semiconductor memory devices, such as EPROM, EEPROM, and flash memorydevices; magnetic disks such as, internal hard disks and removabledisks; magneto-optical disks; and CD_ROM disks. Any of the foregoing canbe supplemented by, or incorporated in, ASICs (application-specificintegrated circuits).

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection (e.g., the copyrighted namesmentioned herein). This material and the characters used herein are forexemplary purposes only. The characters are owned by their respectivecopyright owners.

Other implementations are within the scope of the following claims:

1. A computer implemented method comprising: applying speech recognitionby one or more computer systems to an audio recording to generate a textversion of recognized portions of text; determining by the one or morecomputer systems an elapsed time period from a reference time in theaudio recording to each recognized portion in the audio recording;comparing by the one or more computer systems the recognized portions oftext to expected portions of text; and generating by the one or morecomputer systems a timing file that is stored on a computer-readablestorage medium, the timing file comprising the elapsed time informationfor each expected portion of text by: storing the elapsed timeinformation for a recognized portion into the timing file if therecognized portion matches the corresponding expected portion of text;and otherwise computing the elapsed time information for the expectedportion of text and storing the computed elapsed time information intothe timing file if the recognized portion does not match thecorresponding expected portion of text.
 2. The method of claim 1,wherein the one or more recognized portions or expected portions of textcomprise words.
 3. The method of claim 1, further comprising, duringplay back: providing an audible output corresponding to the audiorecording; and displaying a sequence of words corresponding to at leasta portion of the expected portion of text on a user interface renderedon a display device and providing visual indicia indicating acorrespondence between the audio recording and the expected portion oftext.
 4. The method of claim 1 wherein one or more of the recognizedportions or the expected portions of text are syllables.
 5. The methodof claim 1 wherein computing further comprises: determining the numberof syllables in the expected portion of text; determining the elapsedtime for the determined number of syllables in the recognized portion,and outputting the determined elapsed time to the timing file.
 6. Themethod of claim 1 wherein computing further comprises: determining theelapsed time for an expected portion of text based on a metricassociated with an expected length of time to verbalize the expectedportion of text.
 7. The method of claim 1 wherein computing comprises:displaying on a user interface device, the recognized portions of text,the elapsed times, and the expected portions of text; receiving from auser an indication of timings for the expected portions of text; andstoring elapsed time information in the timing file based on thereceived user indications.
 8. A computer program product residing on acomputer readable medium, the computer program product comprisinginstructions for causing a processor to: apply speech recognition to anaudio recording to generate a text version of recognized portions oftext; determine an elapsed time period from a reference time in theaudio recording to each recognized portion in the audio recording;generate a timing file that is stored on a computer-readable storagemedium, the timing file comprising the elapsed time information for eachexpected portion of text by storing the elapsed time information for arecognized portion into the word timing file if the recognized portionmatches the corresponding expected portion of text, and otherwisecomputing the elapsed time information for the expected portion of textand storing the computed elapsed time information into the timing fileif the recognized portion does not match the expected portion of text.9. The computer program product of claim 8, wherein the one or morerecognized portions or portions of text comprise words.
 10. The computerprogram product of claim 8 wherein the one or more of the recognizedportions or portions of text comprise syllables.
 11. The computerprogram product of claim 8, further comprising, during playback: providean audible output corresponding to the audio recording; display asequence of words corresponding to at least a portion of the expectedportion of text on a user interface rendered on a display device; andprovide visual indicia indicating a correspondence between the portionsin the audio recording and the expected portion of text.
 12. Thecomputer program product of claim 8 wherein the instructions to computethe elapsed time information further comprise instructions to: determinethe elapsed time for an expected portion of text based on a metricassociated with an expected length of time to verbalize the expectedportion of text.
 13. The computer program product of claim 8 wherein theinstructions to compute the elapsed time information compriseinstructions to: display on a user interface device, the recognizedportions of text, the elapsed times, and the expected portions of text;receive from a user an indication of timings for the expected portionsof text; and store elapsed time information in the timing file based onthe received user indications.
 14. A system comprising: a memory; and acomputing device configured to: apply speech recognition to an audiorecording to generate a text version of recognized portions of text;determine an elapsed time period from a reference time in the audiorecording to each recognized portion in the audio recording version;generate a timing file that is stored on a computer-readable storagemedium, the timing file comprising the elapsed time information for eachexpected portion of text by storing the elapsed time information for arecognized portion into the timing file if the recognized portionmatches the corresponding expected portion of text, and otherwisecomputing the elapsed time information for the expected portion of textand storing the computed elapsed time information into the timing fileword if the recognized portion does not match the expected portion oftext.
 15. The system of claim 14, wherein the one or more recognizedportions or portions of text comprise words.
 16. The system of claim 14,wherein the one or more recognized portions or portions of text comprisesyllables.
 17. The system of claim 14, wherein the computing device isfurther configured to, during playback: provide an audible outputcorresponding to the audio recording; display a sequence of wordscorresponding to at least a portion of the expected portion of text on auser interface rendered on a display device; and provide visual indiciaindicating a correspondence between the portions in the audio recordingand the expected portion of text.
 18. The system of claim 14, whereinthe computing device is further configured to: determine the elapsedtime for an expected portion of text based on a metric associated withan expected length of time to verbalize the expected portion of text.19. The system of claim 14, wherein the computing device is furtherconfigured to: display on a user interface device, the recognizedportions of text, the elapsed times, and the expected portions of text;receive from a user an indication of timings for the expected portionsof text; and store elapsed time information in the timing file based onthe received user indications.